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ABSTRACT 

This  year  at  TREC  2002  we  participated  in  the  adaptive 
filtering  sub-task  of  the  filtering  track  with  some  models 
for  training  a  Rocchio  classifier.  Results  were  poorer  than 
average  on  the  utility  type  measures.  Using  simple  feature 
selection  produced  better  than  average  results  on  an  F-type 
measure.  The  key  to  our  approach  was  the  use  of  pseudo¬ 
judgments,  and  an  approach  to  threshold  updating.  We  also 
participated  in  the  batch  filtering  sub-task  of  the  filtering 
track  and  investigated  the  use  of  rank  based  feature  selection 
techniques  in  conjunction  with  a  very  simple  classification 
rule. 

1.  INTRODUCTION 

In  the  adaptive  filtering  sub-task  of  the  filtering  track, 
systems  utilize  a  training  set  consisting  of  a  small  set  of 
documents  which  are  labelled  either  relevant  or  irrelevant  . 
This  is  supplemented  by  a  training  set,  from  which  one  may 
draw  inferences  about  the  corpus,  and  may  hazard  some  con¬ 
jectures  as  to  the  relevant  documents.  In  the  work  reported 
here,  a  simple  version  of  “pseudo-relevance  feedback”  is  used 
to  expand  the  terms  appearing  in  the  3  relevant  documents, 
and  the  original  topic  statement. 
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Our  approach  in  preparing  to  study  the  problem  of  adap¬ 
tive  filtering  attempts  to: 

Incorporate  major  techniques  common  to  high-scoring 
AF  approaches  in  recent  TRECs. 

Allow  easy  modification  of  aspects  we  are  likely  to  be 
doing  experiments  on. 

Be  efficient  enough  to  do  many  tuning  runs. 

-Be  as  simple  as  possible  to  implement  given  the  above 
constraints. 

We  prepared  the  AP  1988-1990  data,  which  served  as  a 
“sandbox”  for  the  selection  of  parameters  in  the  adaptive 
Rocchio  model.  We  used  the  LEMUR  [10]  toolkit  to  manage 
the  text,  build  indices,  etc. 

We  introduced  a  number  of  parameters  controlling  how 
many  examples  will  be  pseudo-labeled  and  with  what  weights: 

In  the  batch  filtering  sub-task  of  the  filtering  track,  sys¬ 
tems  utilize  a  training  set  consisting  of  documents  which  are 
labelled  either  relevant  or  irrelevant  for  a  given  information 
needs  to  develop  static  classifiers  which  attempt  to  distin¬ 
guish  the  documents  labelled  relevant  from  those  labelled 
irrelevant.  In  our  opinion,  efforts  to  attack  this  problem  are 
often  complicated  by  several  characteristics  of  textual  data. 

Textual  data  is  generally  represented  by  using  the  terms 
in  the  text  as  features.  Such  data  is  inherently  highly  dimen¬ 
sional  —  the  number  of  features  being  potentially  equal  to 
the  number  of  words  in  the  English  language.  In  addition, 
misspellings,  the  improper,  or  colloquial  use  of  words,  and 
the  fact  that  many  very  common  words  (e.g.  “a”,  “and”, 
“the”,  etc.)  are  virtually  useless  for  distinguishing  relevant 
documents  from  irrelevant  ones,  regardless  of  the  informa¬ 
tion  need,  lead  textual  data  to  be  noisy.  Finally,  most  terms, 
even  those  not  considered  “noise”  under  the  previous  de¬ 
scription,  are  not  needed  to  distinguish  relevant  documents 
from  irrelevant  documents. 

The  aforementioned  characteristics  of  textual  data  indi¬ 
cate  that  it  might  be  possible  to  represent  a  document  col¬ 
lection  using  only  a  subset  of  the  original  feature  set  which 
is  much  smaller  than  the  original  feature  set,  yet  possesses 
properties  which  serve  to  facilitate  the  process  of  distin¬ 
guishing  relevant  documents  from  irrelevant  documents. 

The  idea  we  pursued  in  the  batch  filtering  sub-task  of  the 
filtering  track  was  to  employ  a  heuristic  designed  to  generate 
such  feature  subset  and  to  then  train  an  extremely  simple 
classifier  on  the  training  set  represented  only  in  terms  of  the 
selected  feature  set. 

2.  ADAPTIVE  FILTERING:  BUILDING  A 
CLASSIFIER 
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2.1  Initialization 

Initial  training  for  each  topic  uses  the  training  set  (on 
which  relevance  status  with  respect  to  the  topic  is  known 
only  for  three  of  the  documents),  and  the  topic  description. 

FOR  EACH  Topic 

Bl.  Read  topic  description 

B2.  Read  initial  3  positive  training  examples  for  this 
topic.  Give  each  of  these  examples  a  weight  of  1.0. 

B3.  Call  scoring  model  learning  algorithm  (Rocchio)  to 
produce  linear  model  based  on  the  topic  description  and  the 
initial  positive  examples. 

B4.  Call  the  “pseudolabeling  algorithm”  to  run  the  linear 
model  trained  in  Step  B3  against  all  training  documents. 
It  will  return  some  portion  of  the  training  documents,  and 
will  have  associated  with  them  positive  or  negative  pseu¬ 
dolabels,  and  fractional  weights.  Essentially,  the  documents 
that  achieve  a  high  score  on  vector  retrieval  with  the  ini¬ 
tial  query  and  3  positive  documents  are  taken  as  “relevant” , 
those  with  low  score  are  taken  as  “not  relevant” .  Our  algo¬ 
rithm  actually  has  a  number  of  parameters  that  control  (a) 
the  dividing  line  between  “pseudo-relevant”  and  “pseudo- 
irrelevant”  documents  (which  we  refer  to,  together,  as  the 
“pseudo-labelled”  documents)  (b)  the  fractions  of  each  class 
that  are  sampled  into  the  updated  Rocchio  classifier  and  (c) 
the  weights  that  each  type  of  “pseudo”  document  are  as¬ 
signed  in  step  B5.  Eventually  these  weights  are  expressed 
in  terms  of  the  number  of  “equivalent  documents”  that  the 
pseudo-labelled  documents  represent. 

B5.  Call  the  classifier  learning  algorithm,  which  changes 
the  query,  a  la  Rocchio,  and  selects  a  threshold  that  maxi¬ 
mizes  the  target  score,  on  the  training  set. 

Numerous  implementation  details  are  not  described  here. 
Note  that  the  idea  of  an  “outer  loop”  over  topics  represents 
just  one  way  to  approach  the  problem,  which  may  not  be 
optimal  for  specific  choices  of  the  learning  algorithm. 

For  further  analytical  work  we  have  since  modified  the 
code  to  save  the  classifiers  or  the  internal  state  of  the  train¬ 
ing  algorithm  to  persistent  storage  after  initial  training.  This 
will  potentially  be  useful  for  multiple  experiments  with  the 
same  starting  point,  as  well  as  for  comparative  experiments 
(studying  improvement  of  classifier  over  time). 

2.2  Adaptive  Phase  of  Training 

In  adaptive  filtering,  we  run  through  the  test  documents 
in  the  specified  order,  applying  classifiers,  getting  judgments 
only  for  documents  judged  relevant,  and  updating  the  clas¬ 
sifiers. 

FOR  EACH  Topic 

FOR  EACH  Test  Document 

Cl.  Apply  current  classifier  for  topic  to  test  document, 
computing  score  and  determining  if  score  is  above  threshold. 

IF  score  is  >  threshold 

C2.1.  Pass  document  ID,  topic  ID,  score,  and  label  (“rel¬ 
evant”)  to  routine  that  writes  output  for  evaluation 

C2.2.  Pass  document  ID  and  topic  ID  to  judging  rou¬ 
tine,  which  will  return  label  (Relevant  vs.  Nonrelevant  vs. 
Unjudged). 

ELSE 

C3.1  Label  =  Unknown 

C4.  Pass  current  classifier,  document  ID,  Label,  and  a 
weight  of  1.0  to  Learner  (which  for  the  baseline  will  be  an 
object  that  in  turn  calls  Rocchio  and  TROT). 


2.3  the  Rocchio  Algorithm 

The  Rocchio  algorithm  [9]  produces  a  linear  model,  which 
must  then  be  specified  with  a  threshold.  The  basic  inputs 
to  the  algorithm  are: 

1.  An  initial  “query”  vector 

2.  A  set  of  document  vectors.  Each  vector  is  accompanied 
by  a  weight  and  a  label. 

3.  The  Rocchio  weighting  parameters  (a,  /3, 7) 

4.  Feature  weighting  parameters 

5.  Feature  selection  parameters  and  rules 

The  Rocchio  algorithm  [9,  4]  is  a  batch  algorithm.  It  pro¬ 
duces  a  new  weight  vector  w  from  an  existing  weight  vector 
Wi  and  a  set  of  training  examples.  The  j th  component  Wj 
of  the  new  weight  vector  is: 


Wj  =  awij  +  (3 


£i, 


sc 


Xi- 


nc 


-7 


£i£C  Xi’0 

n  —  nc 


(1) 


where  n  is  the  number  of  training  examples,  C  =  {1  < 
i  <  n  :  yi  =  1}  is  the  set  of  positive  training  examples  (i.e., 
members  of  the  class  of  interest),  and  nc  is  the  number  of 
positive  training  examples.  The  parameters  a,  (3,  and  7 
control  the  relative  impact  of  the  original  weight  vector,  the 
positive  examples,  and  the  negative  examples,  respectively. 

Typically,  classifiers  produced  with  the  Rocchio  algorithm 
are  restricted  to  having  nonnegative  weights,  so  that  instead 
of  using  the  raw  w  from  Equation  (1),  one  uses  w  where 


{w  if  w  >  0 
0  otherwise. 


This  is  turned  into  a  classifier  by  the  relatively  expen¬ 
sive  process  of  recomputing  the  threshold  after  each  new 
judgment  is  received  on  a  submitted  document.  The  com¬ 
putation  of  the  threshold  can  be  somewhat  accelerated  with 
a  full  Rocchio  model,  but  we  have  not  found  a  way  to  ac¬ 
celerate  it  meaningfully  when  a  non-linear  step  such  as  the 
selection  of  a  number  of  “top  features”  is  included. 


2.4  Retaining  only  the  top  30  terms  in  a  query 

To  improve  performance,  we  limited  the  number  of  terms 
appearing  in  a  query. 

The  specific  algorithm  is  given  in  pseudocode  as 


Algorithm  1:  Query  term  selection 
Require:  query  vector  Q,  k 
1:  for  t  €  Q  do 
2:  if  t  <  0  then 

3:  t  =  0 

4:  end  if 

5:  end  for 

6:  S  =  reverse(sort(Q)) 

Ensure:  S'fl  :  min(\S\,  k)],  the  top  k  positive  components 
of  Q 


3.  ADAPTIVE  TRAINING  HEURISTICS 

To  find  a  Rocchio  classifier  we  started  at  “plausible”  val¬ 
ues  for  all  of  the  parameters  in  the  model,  and  conducted  a 
“greedy”  search  on  each  of  the  parameter  values  separately. 

Original  results  concentrated  on  the  utility  based  mea¬ 
sures,  and  were  terrible.  This  led  to  the  development  of 
a  “TREC-specific”  feature,  which  stops  sending  examples 


for  judgment  if  the  rate  of  success  falls  too  low.  One  such 
heuristic  is  to  stop  when  the  number  of  consecutive  negatives 
exceeds  the  total  accumulated  positive  judgments  obtained. 
Such  heuristics  have  no  meaning  in  the  real  world  situations 
to  which  adaptive  filtering  will  be  applied. 

An  alternative  heuristic,  which  can  be  justified  for  real 
applications,  is  to  reduce  the  number  of  components  in  the 
updated  Rocchio  vector  to  a  very  small  number.  In  one 
such  run  the  number  of  components  is  reduced  to  30.  These 
30  components  are  selected  on  the  basis  of  their  individual 
explanatory  power,  with  regard  to  the  specific  measure  of 
performance  under  considerations.  In  the  submitted  run, 
this  was  an  F-measure. 

Since  F  measures  can  be  rewritten  as  — , - - — — p  they  are 

/3£  +  (!-/3)£  J 

very  sensitive  to  finding  any  relevant  documents.  If  ( g ,  G ) 
are  the  numbers  of  relevant  documents  (found,  in  the  col¬ 
lection)  respectively,  and  n  documents  are  returned, 


•  NW+  :  neg  weight  of  10 

•  NW-  :  neg  weight  of  2 

•  def  :  default  values,  a  =  1.0,  /3  =  1.0,7  —  |i 
ND  s.t.  1000  pseudo-negative  documents  are  se¬ 
lected,  PD  s.t.  20  pseudo-positive  documents  are 
selected,  PW  =  2,  NW=5% 

2.  the  number  of  topics  that  obtained  a  positive  score  in 
the  test 

3.  min  score  -  the  lowest  topic  score 

4.  the  total  score  (sum  of  all  topic  scores  ) 

5.  average  T11U  score 

6.  average  T11F  score 

7.  average  T11SU  score 


F  =  l/(J3n/g  +  (1  -  0)G/g)  =  g/(0n  +  (1  -  0)G) 


So  a  system  that  “hangs  in  there”  and  eventually  produces 
even  a  single  relevant  document  will  score  better  than  a  more 
discriminating  system  that  returns  no  relevant  documents, 
and  quits  sooner. 

The  results  of  our  early  experiments  show  only  that  we 
have  set  up  a  workable  laboratory  for  exploring  a  host  of 
possible  combinations  of  the  five  key  ingredients  of  an  adap¬ 
tive  algorithm:  these  ingredients  are  a  compression  rule;  a 
representation  rule;  a  matching  scheme,  a  learning  scheme, 
and  a  fusion  or  selection  scheme  for  combining  multiple  ap¬ 
proaches  to  each  of  these  five  components.  As  is  well  known 
in  the  information  retrieval  community,  the  adaptive  filter¬ 
ing  task  is  extremely  difficult,  but  we  are  optimistic  that 
previously  unexplored  combinations  of  approaches  may  yield 
meaningful  improvements  in  performance.  The  results  are 
showin  in  Table  2,  which  appears  at  the  end  of  the  paper. 
The  meaning  of  the  row  and  column  labels  is  as  follows. 

1.  label  of  the  run,  which  is  composed  of  3  parts  -  the 
value  of  the  weight  of  the  unjudged  documents  (pa¬ 
rameter  thres.unjWt  -  U-xx  — >  thres.unjWt=xx)  fol¬ 
lowed  by  the  name  of  the  parameter  that  is  changed 
and  the  utility  that  is  optimised  (for  example  ’’best.f’ 
means  that  the  f-beta  utility  is  optimised) 

The  parameter  related  labels  have  the  following  mean¬ 
ings: 

•  A+  :  a  =  2.0 

•  A-  :  a  =  0.5 

•  C+  :  7  =  0.25 

•  C-  :  gamma  =  0.0625 

•  ND+  :  neg  density  s.t.  2000  pseudo  negatives  are 
selected 

•  ND-  :  neg  density  s.t.  500  pseudo-negatives  are 
selected 

•  PD-  :  pos  density  =  0.5,  corresponding  to  10 
pseudo-positive  documents 

•  PW+  :  pos  weight  of  5 

•  PW-  :  pos  weight  of  1 


8.  the  number  of  topics  in  this  run  that  found  at  least  1 
positive  doc 

9.  the  number  of  topics  in  this  run  that  found  at  least  3 
positive  docs 

10.  the  number  of  topics  for  which  at  least  one  document 
was  sent  to  the 

11.  the  ’’giveup  threshold”  for  which  these  results  were 
obtained  oracle. 

3.1  Ratio  Based  Scoring 

In  order  to  provide  variety,  we  also  used  an  alternate  scor¬ 
ing  scheming  in  which  documents  are  ordered  by  a  measure 
of  the  ratio  of  their  similarities  to  the  centroids  of  the  posi¬ 
tive  and  negative  examples.  Thus  it  builds  on  the  relevance 
feedback  information  available  to  Rocchio,  with  a  key  dif¬ 
ference.  Scores  are  calculated  using  the  (regularized)  ratio 
of  distances  between  normalized  vectors.  Specifically,  if  p,  n 
are  the  unit  vectors  corresponding  to  the  centroids  of  the 
positive  and  negative  examples,  and  d  is  the  unit  vector 
corresponding  to  the  document  being  scored,  then 


sc(d) 


f  -  M 

1  -  (p,d) 


(2) 


If  the  denominator  vanishes,  the  value  106  is  used  as  a  de¬ 
fault. 

In  practice  this  was  more  effective  with  a  “Quitting”  rule 
that  cust  off  submission  if,  after  the  first  50  documents  are 
submitted,  we  have  not  achieved  a  postive  utility  score. 


4.  BATCH  FILTERING:  BOOLEAN  MODEL 

Assume  that  there  are  n  >  0  distinct  terms  in  the  docu¬ 
ment  collection  and  associate  an  index  in  V  =  {1,2,...  ,  n} 
with  each  of  these  terms.  Letting  B  =  {0,  1},  we  represent 
each  document  in  the  collection  as  an  n-dimensional  Boolean 
vector  x€B\  Each  component  of  x  corresponds  one  of  the 
distinct  terms  in  the  document  collection,  with  Xi  =  1  if  the 
ith  term  is  present  in  the  document  and  Xi  =  0  if  the  ith 
term  is  absent  from  the  document. 

For  a  subset  SC.  V,  and  vector  a  £  Mv ,  we  shall  let 
a[S]  £  Bs  denote  the  projection  of  a  onto  S  and  for  A'  C  B v 
we  shall  write  A' [S']  as  the  projection  of  X  on  S,  that  is, 


X[S]  =  {a[S]  |  a  €  X}.  For  a  subset  S  C  V  let  us  denote 
by  \S  G  Bn  its  characteristic  vector ,  i.e. 


Xj 


1  if  3  €  S, 

0  otherwise. 


We  shall  refer  to  the  set  of  relevant  documents  as  T  and 
the  set  of  irrelevant  documents  as  F  and  shall  assume  that 
T  (~|  F  =  0,  that  is,  there  do  not  exist  vectors  a  G  T  and 
b  G  F  such  that  a  =  b. 

A  set  S  C  V  is  said  to  be  a  support  set  for  T  and  F 
if  it  has  the  property  that  T[S]  H  F[S]  =  0.  That  is,  S 
is  a  support  set  if  each  relevant  document  represented  in 
terms  of  the  selected  features  subset  can  be  distinguished 
from  each  irrelevant  document  represented  in  terms  of  the 
selected  features  subset. 

The  document  model  described  above  does  not  preserve 
information  about  the  order  in  which  terms  appear  in  the 
document  and  therefore  is  often  referred  to  as  the  bag-of- 
words  representation.  In  addition,  the  Boolean  nature  of 
this  representation  lies  in  contrast  to  a  popular  represen¬ 
tation  known  in  the  information  retrieval  literature  as  the 
vector  space  model,  in  which  the  components  Xi  correspond 
to  the  (relative)  frequency  of  the  term  in  the  document. 


4.1  Measure  of  Separation 

For  a  subset  S  C  V,  we  measure  the  distance  between  the 
projections  T[S]  and  F[S]  of  the  sets  T  and  f  el1  onto 
Bs,  by  the  so  called  average  Hamming  distance.  The  use  of 
Hamming  distance  based  separation,  rather  than  measures 
based  on  the  l\,  I2  or  l^  norms,  as  is  often  the  practice  when 
the  employing  the  vector  space  model,  is  suggested  by  the 
Boolean  nature  of  our  document  model. 

The  Hamming  distance  between  the  vectors  a[S]  G  T[S ] 
and  b[S]  G  F[S]  is  defined  as  ds{a,b)  =  J2jes-a  -^b  1-  The 
average  Hamming  distance  between  the  sets  T[S]  and  F[S] 
then  is  defined  as 

Aa„g(S)  =  -±— ^^ds(a,fc).  (3) 

III  I  aeTbeF 


4.2  Ranking  Functions 

For  each  i  G  V ,  each  of  the  ranking  functions  presented 
here  utilizes  the  following  four  values 

•  at  =  the  number  of  relevant  documents  containing  the 
ith  term 


•  bi  =  the  number  of  irrelevant  documents  containing 
the  ith  term 

•  Ci  =  the  number  of  relevant  documents  which  do  not 
contain  the  ith  term 

•  di  =  the  number  of  irrelevant  documents  which  do  not 
contain  the  ith  term 


For  each  i  G  F,  the  relationship  between  at  bi,  a  and  di 
and  the  document  collection  is  given  by  the  following  2x2 
contingency  table 


yeT 

yeF 

Xi  =  1 

Cli 

bi 

di  +  bi  —  Oi 

Xi  =  0 

Ci 

di 

Ci  -\-  di  —  6i 

di  +  Ci  =  \T\ 

bi  +  di  =  |i^| 

m 

where  the  marginals  9  and  9i  represent  the  number  of  docu¬ 
ments  containing  the  ith  term  and  the  number  of  documents 
which  do  not  contain  the  ith  term  respectively,  and  y  G  B  is 
defined  as 

1  if  x  G  T, 

0  otherwise. 

Obviously,  the  marginals  |T|  and  |Fj  are  constant  for  all 
terms  while  the  marginals  0;  and  9i  vary  for  each  term. 
The  total  number  of  documents  in  the  collection  is  m  — 
at  +  bi  +  d  +  di  which  is  obviously  also  a  constant. 

For  the  simplicity  of  notations,  we  shall  view  all  ranking 
functions  as  functions  of  the  four  parameters  a,  b,  c  and  d, 
though  clearly  there  are  only  two  independent  values  among 
these. 

In  [2]  we  analyzed  and  compared  a  number  of  possible 
ranking  functions,  and  based  on  that  study,  we  selected  5 
such  functions  for  this  TREC  experiment: 

Function  a 

a  b 

a  = - 1 - ; 

a  +  c  b  +  d 

is  the  absolute  value  of  the  difference  between  the  number  of 
relevant-irrelevant  document  pairs  in  the  training  collection 
which  provide  evidence  that  the  ith  term  is  a  good  classifier 
of  relevant  documents  and  the  number  of  relevant-irrelevant 
document  pairs  which  provide  evidence  that  the  ith  term  is 
a  good  classifier  of  irrelevant  documents,  normalized  by  the 
total  number  (i.e.  both  correctly  distinguished  and  incor¬ 
rectly  distinguished)  of  relevant-irrelevant  document  pairs. 

Function  /3 


|  ad  —  bc\ 

mifi 


(4) 


ad  +  bc  ad  +  bc  ad  +  bc  .  . 

( a  +  c)(b  +  d )  ab  +  ad  +  bc  +  cd  |Tj|.F| 

is  the  total  number  of  relevant-irrelevant  document  pairs 
correctly  distinguished  by  the  ith  term,  normalized  by  the 
total  number  of  relevant-irrelevant  document  pairs  in  the 
training  collection. 

Function  7 

ad  ad 

7=  (a  +  c)(b  +  d)  =  mm 

is  an  obvious  variant  of  both  a  and  beta. 

Function  5 


|  ad  —  bc\ 

i/(a  +  b)(c  +  d)(a  +  c)(b  +  d) 


ad  —  be 
\J ee\T\\F\ 


(6) 


is  the  absolute  value  of  the  Pearson  Product  Moment  Cor¬ 
relation  coefficient  or  simply  the  correlation  coefficient  for 
the  Boolean  variables  Xi  and  y  as  defined  above.  It  mea¬ 
sures  the  degree  to  which  these  two  variables  have  a  linear 
relationship. 

Function  p 

(a  +  b  +  c  +  d)  (ad  —  be)2  m(ad  —  bc )2  . 

(a  +  6)(c  +  d)(a  +  c)(6  +  d)  $0|T,||F’| 


is  the  x2  statistic  for  the  Boolean  variables  x,  and  y  as 
defined  above  and  provides  another  measure  of  association 
for  these  two  variables. 

Note  that  p  is  a  onotone  funcion  of  <5,  so  that  our  proce¬ 
dure,  as  described  below,  effectively  gives  a  “double  weight” 
to  this  particular  measure  of  effectiveness. 


4.3  Training  the  Batch  Classifier 

This  section  describes  the  feature  selection  method  and 
the  simple  classifier  used  in  the  batch  filtering  sub-task  of 
the  filtering  track. 

The  set  of  unique  terms  in  the  training  set  T  U  F  was 
ranked  by  each  of  the  five  ranking  functions  a,/?,  7 ,5,p  de¬ 
scribed  in  §4.2.  Five  intermediate  feature  sets,  Sa,  Sp,  S7, 
Ss ,  Sp,  were  constructed  using  the  top  ranking  K  =  50  terms 
of  the  corresponding  ranking  functions.  Letting 

S  =  Sa  U  Sp  U  S7  U  S5  U  Sp 

we  assigned  a  score  ip  £  {1,  •  •  •  ,  5}  to  each  of  the  terms  in 
tiledS,  defined  as  the  number  of  sets  S(,  £  £  {a,  (3, 7, 5,  p} 
in  which  the  term  appeared.  The  final  feature  set  S  was 
constructed  by  selecting  the  K  =  50  terms  with  the  highest 
ip  scores. 

Next,  to  each  term  in  i  £  S  we  assigned  the  weight 

a ;  +0.5 


bi+di+1 

which  can  be  seen  to  be  the  Bayesian  weight  of  evidence, 
and  to  each  document  y  £  T[S]  UT[S]  we  assigned  the  score 

y )  =  log(w(*))yT 

jes 

That  is,  each  document  projected  onto  the  selected  feature 
set  S  is  assigned  a  score  equal  to  the  sum  of  the  logarithms 
of  the  Bayesian  weights  of  evidence  for  the  terms  it  contains. 

The  batch  filtering  task  requires  the  definition  of  a  static 
classification  rule  which  specifies  whether  each  document  in 
the  test  set  should  be  considered  relevant  and  retrieved,  or 
irrelevant  and  ignored.  The  rule  we  utilized  specifies  that 
y  £  TfS1]  U  F[S ]  will  be  retrieved  if  and  only  if  Q(y)  >  r  for 
some  r  £  R.  The  threshold  r  was  selected  so  as  to  optimize 
the  utility  measure  TU 11  =  2 R  —  I  over  the  training  set, 
where  R  is  the  number  of  relevant  documents  retrieved  by 
the  system  and  I  is  the  number  of  irrelevant  documents 
retrieved  by  the  system. 

5.  RESULTS 

5.1  Filtering  Results 

Our  training  results,  using  a  variety  scoring  measures,  for 
a  great  variety  of  training  runs,  are  shown  at  the  end  of  the 
paper  in  Table  2.  In  the  final  analysis,  our  results  at  TREC 
were  in  the  middle  of  the  pack. 

These  are  summarized  in  Table  1. 


Method 

Mean  Til 

dimacsddl02a 

0.110 

dimacsllaAPQ 

0.142 

dimacsddl02b 

0.293 

dimacsllaPlQ 

0.272 

dimacslla30Q 

0.337 

Table  1:  TREC  2002  Results  for  the  Assessor  topics, 
various  runs 

The  best  results  were  achieved  by  the  run  submitted  as 
dimacsllaSOQ .  This  was  a  Rocchio  method,  trained  on  a 
set  of  documents  similar  to  the  one  used  at  TREC.  The  “ 


30”  indicates  that  only  the  top  30  terms,  that  is,  the  30 
terms  with  highest  weight  in  the  updated  query  vector  were 
included.  “  AP”  includes  only  the  terms  with  positive  weight 
are  retained.  “  Q”  indicates  that  for  our  final  submission 
we  cut  off  submission  if  we  did  not  achieve  a  positive  score 
after  submitting  50  documents  for  judgment.  “  P1Q”  used  a 
ratio  scoring  scheme,  together  with  the  “  quit  at  50  if  score 
is  negative  rule.  This  is,  of  course,  a  “  TREC  strategy” 
and  not  a  procedure  that  would  be  useful  in  a  real  world 
application. 

We  have  subsequently  learned  that  with  proper  learn¬ 
ing  parameters,  as  chosen  by  the  group  from  the  Chinese 
Academy  of  Sciences,  it  is  possible  for  a  Rocchio  approach 
similar  to  ours  to  achieve  very  good  results.  We  are  not 
certain  as  to  which  steps  of  our  approch  blocked  us  from 
realizing  this  high  level  of  performance.  One  possibility  is 
that  even  the  small  number  of  pseudo-negative  cases  that  we 
introduce  into  the  training  is  sufficient  to  keep  us  away  from 
the  region  of  good  performance.  Another  is  that  the  space 
of  parameters  is  too  large,  and  the  dependence  of  the  learn¬ 
ing  too  complex,  to  be  successfully  explored  “one  variable 
at  a  time”,  which  was  essentially  the  heuristic  used.  Other 
inhibiting  factors  may  have  included  the  heursitcs  used  to 
cut  off  submission  if  we  did  not  achieve  a  positive  score  after 
the  first  50  judgments.  Nonetheless,  our  submission  that  did 
use  this  heuristic  fared  better  than  those  that  did  not. 

5.2  Batch  Results 

On  the  assessor  judged  topics  our  TU11  score  was  less 
than  the  median  fifteen  times,  equal  to  the  median  twenty 
times,  and  greater  than  the  median  fifteen  times  and  never 
attained  the  maximum. 

On  the  intersection  topics  our  TU11  score  was  less  than 
the  median  once,  equal  to  the  median  six  times,  and  greater 
than  the  median  forty-three  times  and  attained  the  the  max¬ 
imum  twenty-one  times.  Unfortunately,  for  many  of  these 
topics,  submitting  no  documents  at  all  was  an  effective  TREC 
strategy. 

6.  CONCLUSIONS 

This  work  is  part  of  a  larger  effort  to  develop  an  array 
of  approaches  to  filtering  problems,  and  to  integrate  or  fuse 
them  for  greater  effectiveness.  I11  this  first  effort  it  would  ap¬ 
pear  that  we  have  adopted  tools  that  are  capable  of  “state  of 
the  art”  perfromance  on  the  adaptive  filtering  task,  but  have 
not  yet  learned  how  to  ensure  that  this  level  of  performance 
is  achieved. 
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